NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

SCV-GNN: Sparse Compressed Vector-based Graph Neural Network Aggregation

https://doi.org/10.1109/TCAD.2023.3291672

Unnikrishnan, Nanda K.; Gould, Joe; Parhi, Keshab K. (July 2023, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems)

Graph neural networks (GNNs) have emerged as a powerful tool to process graph-based data in fields like communication networks, molecular interactions, chemistry, social networks, and neuroscience. GNNs are characterized by the ultra-sparse nature of their adjacency matrix that necessitates the development of dedicated hardware beyond general-purpose sparse matrix multipliers. While there has been extensive research on designing dedicated hardware accelerators for GNNs, few have extensively explored the impact of the sparse storage format on the efficiency of the GNN accelerators. This paper proposes SCV-GNN with the novel sparse compressed vectors (SCV) format optimized for the aggregation operation. We use Z-Morton ordering to derive a data-locality-based computation ordering and partitioning scheme. The paper also presents how the proposed SCV-GNN is scalable on a vector processing system. Experimental results over various datasets show that the proposed method achieves a geometric mean speedup of 7.96× and 7.04× over CSC and CSR aggregation operations, respectively. The proposed method also reduces the memory traffic by a factor of 3.29× and 4.37× over compressed sparse column (CSC) and compressed sparse row (CSR), respectively. Thus, the proposed novel aggregation format reduces the latency and memory access for GNN inference.
more » « less
Full Text Available
InterGrad: Energy-Efficient Training of Convolutional Neural Networks via Interleaved Gradient Scheduling

https://doi.org/10.1109/TCSI.2023.3246468

Unnikrishnan, Nanda K.; Parhi, Keshab K. (February 2023, IEEE Transactions on Circuits and Systems I: Regular Papers)

This paper addresses the design of accelerators using systolic architectures to train convolutional neural networks using a novel gradient interleaving approach. Training the neural network involves computation and backpropagation of gradients of error with respect to the activation functions and weights. It is shown that the gradient with respect to the activation function can be computed using a weight-stationary systolic array, while the gradient with respect to the weights can be computed using an output-stationary systolic array. The novelty of the proposed approach lies in interleaving the computations of these two gradients on the same configurable systolic array. This results in the reuse of the variables from one computation to the other and eliminates unnecessary memory accesses and energy consumption associated with these memory accesses. The proposed approach leads to 1.4−2.2× savings in terms of the number of cycles and 1.9× savings in terms of memory accesses in the fully-connected layer. Furthermore, the proposed method uses up to 25% fewer cycles and memory accesses, and 16% less energy than baseline implementations for state-of-the-art CNNs. Under iso-area comparisons, for Inception-v4, compared to weight-stationary (WS), Intergrad achieves 12% savings in energy, 17% savings in memory, and 4% savings in cycles. Savings for Densenet-264 are 18% , 26% , and 27% with respect to energy, memory, and cycles, respectively. Thus, the proposed novel accelerator architecture reduces the latency and energy consumption for training deep neural networks.
more » « less
Full Text Available
Multi-Channel FFT Architectures Designed via Folding and Interleaving

https://doi.org/10.1109/ISCAS48785.2022.9937347

Unnikrishnan, Nanda K.; Parhi, Keshab K. (May 2022, 2022 IEEE International Symposium on Circuits and Systems (ISCAS))

Computing the FFT of a single channel is well understood in the literature. However, computing the FFT of multiple channels in a systematic manner has not been fully addressed. This paper presents a framework to design a family of multi-channel FFT architectures using folding and interleaving. Three distinct multi-channel FFT architectures are presented in this paper. These architectures differ in the input and output preprocessing steps and are based on different folding sets, i.e., different orders of execution.
more » « less
Full Text Available
LayerPipe: Accelerating Deep Neural Network Training by Intra-Layer and Inter-Layer Gradient Pipelining and Multiprocessor Scheduling

https://doi.org/10.1109/ICCAD51958.2021.9643567

Unnikrishnan, Nanda K.; Parhi, Keshab K. (November 2021, 2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD))

The time required for training the neural networks increases with size, complexity, and depth. Training model parameters by backpropagation inherently creates feedback loops. These loops hinder efficient pipelining and scheduling of the tasks within the layer and between consecutive layers. Prior approaches, such as PipeDream, have exploited the use of delayed gradient to achieve inter-layer pipelining. However, these approaches treat the entire backpropagation as a single task; this leads to an increase in computation time and processor underutilization. This paper presents novel optimization approaches where the gradient computations with respect to the weights and the activation functions are considered independently; therefore, these can be computed in parallel. This is referred to as intra-layer optimization. Additionally, the gradient computation with respect to the activation function is further divided into two parts and distributed to two consecutive layers. This leads to balanced scheduling where the computation time of each layer is the same. This is referred to as inter-layer optimization. The proposed system, referred to as LayerPipe, reduces the number of clock cycles required for training while maximizing processor utilization with minimal inter-processor communication overhead. LayerPipe achieves an average speedup of 25 % and upwards of 80% with 7 to 9 processors with less communication overhead when compared to PipeDream.
more » « less
Full Text Available
A Gradient-Interleaved Scheduler for Energy-Efficient Backpropagation for Training Neural Networks

https://doi.org/10.1109/ISCAS45731.2020.9181242

Unnikrishnan, Nanda; Parhi, Keshab K. (September 2020, 2020 IEEE International Symposium on Circuits and Systems (ISCAS))

This paper addresses design of accelerators using systolic architectures for training of neural networks using a novel gradient interleaving approach. Training the neural network involves backpropagation of error and computation of gradients with respect to the activation functions and weights. It is shown that the gradient with respect to the activation function can be computed using a weight-stationary systolic array while the gradient with respect to the weights can be computed using an output-stationary systolic array. The novelty of the proposed approach lies in interleaving the computations of these two gradients to the same configurable systolic array. This results in reuse of the variables from one computation to the other and eliminates unnecessary memory accesses. The proposed approach leads to 1.4–2.2× savings in terms of number of cycles and 1.9× savings in terms of memory accesses. Thus, the proposed accelerator reduces latency and energy consumption.
more » « less
Full Text Available
Brain-Inspired Computing: Models and Architectures

https://doi.org/10.1109/OJCAS.2020.3032092

Parhi, Keshab K.; Unnikrishnan, Nanda K. (January 2020, IEEE Open Journal of Circuits and Systems)
Effect of Finite Word-Length on SQNR, Area and Power for Real-Valued Serial FFT

https://doi.org/10.1109/ISCAS.2019.8702289

Unnikrishnan, Nanda K.; Garrido, Mario; Parhi, Keshab K. (May 2019, Proc. 2019 IEEE International Symposium on Circuits and Systems (ISCAS))

Modern applications for DSP systems are increasingly constrained by tight area and power requirements. Therefore, it is imperative to analyze effective strategies that work within these requirements. This paper studies the impact of finite word-length arithmetic on the signal to quantization noise ratio (SQNR), power and area for a real-valued serial FFT implementation. An experiment is set up using a hardware description language (HDL) to empirically determine the tradeoffs associated with the following parameters: (i) the input word-length, (ii) the word-length of the rotation coefficients, and (iii) length of the FFT on performance (SQNR), power and area. The results of this paper can be used to make design decisions by careful selection of word-length to achieve a reduction in area and power for an acceptable loss in SQNR.
more » « less
Full Text Available

Search for: All records